DAX Eigencharts

Research question

When I was 12 years old, I saw an intraday chart of the German stock index (DAX) for the first time. It was in the local newspaper and came with lots of numbers and figures for different stocks, currencies, commodities etc. And all these numbers and especially the DAX intraday chart fascinated me at once: It was love on the first sight!

Since then I have seen many hundrets of DAX charts. And at some point I had the strange feeling, that there is only a limited number of different DAX charts. And if that assumption holds true, one could take advantage of that. My idea was was the following: If I only saw the first half of the DAX intraday chart, I could try to find similar charts and predict a trend for the second half of the day and bet either on stocks going further up or down.

To do this, I need to answer three questions:

  • Are there characteristic Eigencharts that explain the intraday DAX movement?
  • If yes, how many of these Eigencharts do I need to capture a reasonable amount of the total variance in DAX movements?
  • If dimensionality reduction does not work efficiently, can we at least cluster DAX movements?

My approach

Data

There are plenty of possibilities to get intraday charts for all kinds of stocks, indices, currencies etc. from the internet. In this project we work with data from BacktestMarket. They provide a history of DAX charts on a daily basis back to 2000. It comes as so called tick data: You get DAX values for each minute: Open, high, low, close and volume.

The complete data set is too large to put it on GitHub. Therefore we need to focus on a one year history. I picked 2018 data.

Methods

As outlined in the research question, we want to look for Eigencharts. This can best be done with principal component analysis (PCA). Cumulating the obtained eigenvalues gives us a measure of the partial amount of variance captured. This answers the first two questions.

And in the end we want to do some kmeans clustering to answer the third question.

Analysis Part 1 (PCA)

We need to include the following libraries:

The following code chunk can only be run with the original data set. As I mentioned above, the complete data set is too large to put it on GitHub. In case you want to work with the whole data, feel free to ask me.

Preprocessing Data

We are now ready to further preprocess the data.

  • In this project we focus on intraday data for DAX movements on a minute basis for the year 2018. We picked the closing value, which gives us the DAX value for the end of each minute (dax_2018).
  • We want to obtain a separate time series for each date, thus pivot the minutes to be columns and put in the closing value for each minute (dax_2018_piv).
  • This leaves us with 251 time series, each containing 840 minutes from 01:00:00 to 14:59:00. Remember the time shift of 7 hours, since data stems from US (NY time zone).
  • We need to look for missing values and filter only for the 165 complete time series (dax_2018_piv_clean).
  • We then need to sort columns to have an ascending order of minutes (dax_series).

Difference Matrix

  • A DAX chart can be fully explained by the starting value plus the absolute difference from minute to minute. So let’s calculate the matrix of absolute differences for each time series (dax_series_diff).
  • To efficiently calculate the minutely differences we simply shift the dax series by one minute (dax_series_sub1 and dax_series_sub2) and substract the shifted series.
  • Note: The first column gives us the date.

Example minutely Changes

  • Let’s plot an example series of the differences from minute to minute for one DAX intraday chart time series to see what we are working with.
  • For ggplot we need to convert the transposed time series to a data.frame (t_vec_df).

Average DAX Chart

  • As prerequisite for PCA we now calculate the series of average absolute minutely changes over all 165 dax series (avg_diff).
  • We then restore the corresponding chart from the series of average differences and plot the chart.

PCA

  • To prepare the input data for our PCA analysis we need to center each time series of minutely differences by substracting the average difference from every time series (dax_series_diff_center).
  • Scaling is not necessary since we explicitely want to capture large volatility in intraday movements as a feature.

Remember: We look for Eigencharts. But before we eventually discover our Eigencharts, we perform PCA on the centered time series of absolute minutely changes (dax_series_diff_center). Some remarks on dimensionality are helpfull:

  • We startet with 165 DAX time series, each of length 840 (dax_series).
  • We then came to 165 series of minutely changes of length 839, since first column is zero.
  • The average minutely change was substracted and we obtained (dax_series_diff_center)
  • PCA is done by looking for eigenvalues and eigenvectors of the covariance matrix (dax_cov).
  • With an input of 165 time series, we can obtain a maximum of 165 non-zero eigenvalues.
  • In our case we find 164 non-zero eigenvalues.
  • There are 839 eigenvectors, but with only 164 non-zero eigenvalues we only need to take the first 164 eigenvectors into account.
  • To make sure we found eigenvalues and eigenvectors, we shortly proof some characteristics.
  • The so called PC scores give us the projection of the original data onto the eigenvectors.
  • However - I had some trouble to get the PC scores. Therefore I needed to convert dax_series_diff_center first to data.frame (dax_series_diff_center_df) and later to matrix - by hand (dax_series_diff_center_matrix).
  • Otherwise R doesn’t let us compute the PC scores…
  • To recover the original matrix of centered minutely changes time series, one needs to multiply the PC scores (pc_scores) with the inverse eigenvector-matrix.
  • The inverse in this case is just the transpose.
  • Before we finally look at the Eigencharts, we first want to answer the second question: How many Eigencharts do I need to capture a reasonable amount of the total variance in DAX movements?
  • By reasonable we mean more than 75% in our case.
  • To answer the question, we need to accumulate the eigenvalues and divide by the sum over all eigenvalues.
  • Plotting the accumulated variance over the number of necessary eigenvalues confirms that we need at least the first 75 eigenvectors to capture more than 75% of the total variance.

Recover DAX series

Let’s have a look, how we can recover a single DAX time series from an increasing number of eigenvectors. To obtain the original data we need to multiply PC scores with the transpose eigenvector matrix - as mentioned earlier. But we can also restrict ourselves to the first n eigenvectors and the first n PC scores to produce an approximation. The more eigenvectors we include, the better the approximation. As seen, recovering a chart from the first 75 eigenvectors captures 75% of the total variance and thus should give us an acceptable approximation of the original DAX chart.

  • We have already seen a plot of the first intraday DAX series above.
  • Now we try to reproduce the first DAX intraday chart using an increasing number of eigenvectors.
  • Using only the first 10 eigenvectors does not fit the original series at all.
  • Taking the first 75 eigenvectors into account, we can recognize the original series.
  • And from all 165 eigenvectors we end up with the exact original data.
  • Since we applied PCA to centered minutely changes, we need to de-mean the recovered time series of minutely changes by adding the average change from above (avg_diff).
  • And to restore the DAX chart, minutely changes need to be summed up and the starting value needs to be added.
# recover 1st DAX series from 165 eigenvectors
recov_165 <- pc_scores[,1:165] %*% t(eigenvectors)[1:165,]
recov_165_s1_diff <- recov_165[1,]
recov_165_s1_diff_demean <- avg_diff # initialize
recov_165_s1_diff_demean[2:840] <- recov_165_s1_diff_demean[2:840] + recov_165_s1_diff
recov_165_s1_diff_demean_acc <- rep(0,840) # initialize
recov_165_s1_diff_demean_acc[1] <- as.double(dax_series[1,2]) # set starting value

for (i in 2:840){
  recov_165_s1_diff_demean_acc[i] <- recov_165_s1_diff_demean_acc[i-1] +
    recov_165_s1_diff_demean[i]
}

vec_165 <- as.data.frame(recov_165_s1_diff_demean_acc)

p_recov_165 <- ggplot(vec, mapping = aes(x=minute, y=vec_165$recov_165_s1_diff_demean_acc)) +
  geom_line() + 
  xlab("minute") + 
  ylab("DAX chart") +
  ggtitle("from 165 eigenvectors")

# recover 1st DAX series from 75 eigenvectors
recov_75 <- pc_scores[,1:75] %*% t(eigenvectors)[1:75,]
recov_75_s1_diff <- recov_75[1,]
recov_75_s1_diff_demean <- avg_diff # initialize
recov_75_s1_diff_demean[2:840] <- recov_75_s1_diff_demean[2:840] + recov_75_s1_diff
recov_75_s1_diff_demean_acc <- rep(0,840) # initialize
recov_75_s1_diff_demean_acc[1] <- as.double(dax_series[1,2]) # set starting value

for (i in 2:840){
  recov_75_s1_diff_demean_acc[i] <- recov_75_s1_diff_demean_acc[i-1] +
    recov_75_s1_diff_demean[i]
}

vec_75 <- as.data.frame(recov_75_s1_diff_demean_acc)

p_recov_75 <- ggplot(vec, mapping = aes(x=minute, y=vec_75$recov_75_s1_diff_demean_acc)) +
  geom_line() + 
  xlab("minute") + 
  ylab("DAX chart") +
  ggtitle("from 75 eigenvectors")

# recover 1st DAX series from 10 eigenvectors
recov_10 <- pc_scores[,1:10] %*% t(eigenvectors)[1:10,]
recov_10_s1_diff <- recov_10[1,]
recov_10_s1_diff_demean <- avg_diff # initialize
recov_10_s1_diff_demean[2:840] <- recov_10_s1_diff_demean[2:840] + recov_10_s1_diff
recov_10_s1_diff_demean_acc <- rep(0,840) # initialize
recov_10_s1_diff_demean_acc[1] <- as.double(dax_series[1,2]) # set starting value

for (i in 2:840){
  recov_10_s1_diff_demean_acc[i] <- recov_10_s1_diff_demean_acc[i-1] +
    recov_10_s1_diff_demean[i]
}

vec_10 <- as.data.frame(recov_10_s1_diff_demean_acc)

p_recov_10 <- ggplot(vec, mapping = aes(x=minute, y=vec_10$recov_10_s1_diff_demean_acc)) +
  geom_line() + 
  xlab("minute") + 
  ylab("DAX chart") +
  ggtitle("from 10 eigenvectors")

grid.arrange(p_recov_10, p_recov_75, p_recov_165, nrow = 3)

PCA Conclusion

We started with 165 time series of intraday DAX movements. As we have seen, we need at least 75 eigenvectors to capture a reasonable amount - in this case 75% - of the total variance. Before the analysis I hoped to end up with - let’s say - ten Eigencharts to explain the world of intraday DAX charts and to predict movements for new charts that are not part of the test set. This doesn’t work, dimensionality reduction is not as efficient as I expected.

Analysis Part 2 (kmeans)

The first part of the story ends here. Explaining the universe of intraday DAX charts by only a few eigencharts failed. But are there similarities in all these DAX charts? Can they be clustered? But how do we want to define similarity of two DAX charts?

One way to start with is the following: We split each daily DAX chart in two parts: The first 7 hours of trading (a.m.) and the second 7 hours (p.m.).

Preprocessing the data

  • Let’s start over with all 165 intraday DAX charts from the first part of this project (dax_series).
  • We create a new table (dax_movement) and keep only the date column and calculate the DAX change in the first half (01:00:00 - 08:00:00) and the second half (08:00:00 - 14:59:00) of the trading day.
  • Call the first half am_movement and the second half pm_movement.
  • Visualizing the data can best be done with a scatter plot with am_movement on the x-axis and pm_movement on the y-axis.

kmeans

Let’s look for homogeneous subgroups in dax movements by performing kmeans clustering. kmeans clustering is a method for partitioning the observations into k groups. The number of clusters k needs to be pre-specified. The kmeans algorithm produces a large number of return values, notably:

  • cluster: the cluster to which each chart is allocated.
  • centers: a matrix of cluster centers.
  • withinss: within-cluster sum of squares (= the sum of the squared Euclidean distances to the cluster center)
  • tot.withinss: Total within-cluster sum of squares (= the sum of the withinss).
  • size: The number of charts in each cluster.

We will use the Lloyd algorithm as discussed in class. There are other algorithms and the major difference between the algorithms is how they do the initial assignment of data points to clusters: The Lloyd algorithm therefore selects random DAX movements and defines them as the initial cluster centers. All other DAX movements are then assigned to these cluster centers such that the within-cluster variance is minimized. Then the cluster center is re-calculated. This represents one iteration of the Lloyd algorithm.

We increase the maximum number of iterations iter.max from default (= 10 iterations) to 30, to ensure the algorithm to converge. And we want to repeat the cluster assignment 20 times by setting nstart. This avoids ending up in a local minimum of total within-cluster sum of squares.

## [1] 1202976
## [1] 1202587
## [1] 1202976
## [1] 1202662
## [1] 1202662
## [1] 1202662
## [1] 1202662
## [1] 1202976
## [1] 1202662
## [1] 1453188
## [1] 1202662
## [1] 1202662
## [1] 1204105
## [1] 1203963
## [1] 1202662
## [1] 1202662
## [1] 1202662
## [1] 1202662
## [1] 1202662
## [1] 1202976

To find the optimal number k of clusters, we use the function wss_fct from class and apply the ellbow-method.

## [1] 2569798
## [1] 1825746
## [1] 1202587
## [1] 941697.4
## [1] 755044.2
## [1] 632497.8
## [1] 533795.8
## [1] 469278
## [1] 421830.6
## [1] 376440.8
## [1] 351764.1
## [1] 317566.6
## [1] 287824.6
## [1] 272852.8
## [1] 264460.8

Having three or four clusters** seem to be the optimum in this case. We continue with three clusters here, add the cluster assignment to the data (dax_movement_cluster) and plot the clustering results.

kmeans Conclusion

We have divided the daily DAX movements in two halfs: a.m. and p.m. referring to the first and last 7 hours of a trading day, respectively. This can be used for applying simple trading strategies. For instance by knowing how the DAX has performed in the first half, one could bet on the performance in the second half. This can be done with so called KnockOut certificates.

However, the results we obtained in our clustering analysis are not useful for several reasons:

  • Although we find three clusters to be an optimum, it looks more like ONE big diffuse cluster, since clusters are not seperated here.
  • There is a big range in both, am_movements and pm_movements and movements seam to - more or less - average out.
  • We do not see a clear tendency for the DAX to move into any direction, it seems to be randomly distributed.
## [1] -68.22277
## [1] -5.584158
## [1] 66.91406
## [1] -6.148438

Outlook

With the methods we learned in class, I am finally able to analyse the bahaviour of daily stock market indices. My ideas to find a small number of Eigencharts to explain the universe of DAX movements as well as clustering DAX performance by a.m. and p.m movements have failed. But I have some more ideas in my head. And: If it was easy to predict stock markets, many people would make huge profits. That would be boring… ;-)

Marco Landt-Hayen

12 12 2019